Skip to content

ENH: change BlockManager pickle format to work with dup items #7370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 1, 2014

Conversation

immerrr
Copy link
Contributor

@immerrr immerrr commented Jun 6, 2014

Old BlockManager pickle format stored items which were ambiguous and not enough for unpickling if non-unique.

After recent BlockManager overhaul, #6745, it's now possible to it's no longer necessary to share items/ref-items between Blocks and their respective managers, so Blocks can now be safely pickled/unpickled on their own, but Blocks still cannot be safely pickled/unpickled on their own, because they a) are part of internal API and are subject to change and b) were never intended to be serialized like that and their setstate is written in non-forward-compatible manner.

So, I've took the freedom to make the format more "extendable" by storing the state as the fourth element in BlockManager state tuple:

(axes_array, block_values, block_items,
 {
     "0.14.1": {
         "axes": axes_array,
         "blocks": [{"values": blk0.values,
                     "mgr_locs": blk0.mgr_locs.indexer},
                    ...]
     }
 })

I was aiming for the following design goals:

  • use the fact that pickle doesn't store same objects multiple times, i.e. adding an extra element to the tuple should add minimum overhead
  • make state as forward compatible as possible: new versions simply add new elements to that dictionary along with the old ones
  • make state as backward compatible as possible: no extra rules "if this then parse like that else ...", simply look up a supported version in the dictionary and use it
  • don't store classes that are subject to change, I almost got bitten with this one when tried to serialize blocks as a whole, 0.13.1 version, that had different (and incompatible) getstate implementation, croaked.

This PR should close #7329.

  • io pickle tests: verify 0.14.0 pickles are handled properly
  • ensure forward compatibility with 13.1
  • generate 14.1 pickles (for which platforms?)
  • ensure forward compatibility with 14.0
  • ensure forward compatibility with 12.0 ? not necessary
  • make sure sparse containers with non-unique items work in those versions too

@jreback
Copy link
Contributor

jreback commented Jun 6, 2014

this is breaking compat by making a new format which is incompatible with older versions

we don't change this format lightly (maybe should be changed but 0.15 would be better)

so see if can fix w/o changing the format now

see tests in io/tests/test_pickle.py

need to add duplicate items to the generation script and make test pickles for 0.14 that generate this error

@immerrr
Copy link
Contributor Author

immerrr commented Jun 6, 2014

Well, this is as simple as it could get, forward compatibility is tricky, especially if not kept in mind when a protocol is designed.

I suppose, if we dropped the "beta" protocol mentioned in the inline comment, we could store the dictionary as stated above in the 4-th element of the state tuple which, as of 0.14.0, is silently skipped with the first three being populated as before. Pickle is smart enough not to serialize same object multiple times, so it's only the metadata that will get duplicated...

@jreback
Copy link
Contributor

jreback commented Jun 6, 2014

what I mean is don't change the protocol at all
if in 0.15 we need to then that's different

@immerrr
Copy link
Contributor Author

immerrr commented Jun 6, 2014

Ah, I've lost the first paragraph of previous comment somewhere. What it said is that I don't see how equal items can be told from one another unless we change the protocol.

So, I guess, this PR should wait till 0.15 then.

@jreback
Copy link
Contributor

jreback commented Jun 6, 2014

why would you not catch the reindexing error and then just do this?

ipdb> p make_block(values,placement=np.arange(len(items)))._mgr_locs.as_array
array([0, 1, 2, 3])

I agree prob should change the format to be better as well. Pls add additional tests that mimic the user behavior (in io/pickle).

@jreback jreback added the Compat label Jun 6, 2014
@jreback jreback added this to the 0.14.1 milestone Jun 6, 2014
@immerrr
Copy link
Contributor Author

immerrr commented Jun 6, 2014

Hmm, this would only work for block managers with one blocks, right?

Writing duplicate-aware format into the extra tuple element and adding a couple of other tricks made pickling forward compatible with 0.12.0 which is the oldest version I have at hand. Which makes me wonder if there any guidelines on io deprecation, i.e. when ensuring forward compatibility which version should be the oldest to be supported?

@jreback
Copy link
Contributor

jreback commented Jun 6, 2014

if you can forward compat to 0.12 that's excellent (and good enough). the 0.12-0.13.0 pickle transition was difficult because of the subclassing, but the format didn't change per se (though was extended a bit for sparse types, which already pickle with a dict rather than tuples).

@immerrr
Copy link
Contributor Author

immerrr commented Jun 6, 2014

Ok, I'll add more tests and make sure sparse items are handled, too.

@jreback
Copy link
Contributor

jreback commented Jun 17, 2014

@immerrr how's this coming?

@immerrr
Copy link
Contributor Author

immerrr commented Jun 17, 2014

I got myself too deep into how do I implement consistent forward-compatibility checks.

Doing this once is easy, ofc, but I don't see an easy way to check that short of reinventing something vbench-ish, as in write a pickle with target revision, check out baseline revision, build it, unpickle the file and compare to etalon, and so on...

@jreback
Copy link
Contributor

jreback commented Jun 17, 2014

you can generate pickles for a version here: https://github.com/pydata/pandas/blob/master/pandas/io/tests/generate_legacy_pickles.py
Then put then in the legacy dir at which point they become tests

@immerrr
Copy link
Contributor Author

immerrr commented Jun 17, 2014

This is good for backward compatibility, yes, but how do I check that 0.13 haven't ceased to read current pickles?

@jreback
Copy link
Contributor

jreback commented Jun 17, 2014

well one time test is to do it in a virtual env

@immerrr
Copy link
Contributor Author

immerrr commented Jun 18, 2014

I did some testing, 0.13.1 works fine (doesn't croak, at least), as does 0.14.0. Alas, 0.12.0 fails to read Series (due to SingleBlockManager class). Is there a way around that?

Also, I cannot help with many of the platforms I find in pandas/io/tests/data/legacy_pickle, whom shall I refer to to add darwin, windows, and linux i686 variants?

@jreback
Copy link
Contributor

jreback commented Jun 18, 2014

I don't think forward compat with 0.12.0 is a big deal, IOW I don't expect 0.12.0 pandas to be able to read pickles generated in 0.13 + at all (its really just backward compat that we care about), though forward compat to 0.13.0 is nice too.

@immerrr
Copy link
Contributor Author

immerrr commented Jun 18, 2014

Ok, what about legacy pickles I can't easily create?

@jreback
Copy link
Contributor

jreback commented Jun 18, 2014

don't need to create them, they already exist. What are you trying to do?

@immerrr
Copy link
Contributor Author

immerrr commented Jun 18, 2014

Well the format is about to change slightly so I figured maybe it's worth double checking it won't break in the future.

I've also bumped the original message adding a rationale for the chosen format extension. If you're ok with both the rationale and the implementation I.'ll put up a release note tomorrow.

On Thu, Jun 19, 2014 at 12:41 AM, jreback [email protected]
wrote:

don't need to create them, they already exist. What are you trying to do?

Reply to this email directly or view it on GitHub:
#7370 (comment)

# discard anything after 3rd, support beta pickling format for a little
# while longer
ax_arrays, bvalues, bitems = state[:3]
extra_state = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you add a version tag instead, this seems kind of odd to do it this way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an "upcoming" version tag, as in the first stable version having this serialization format, serving as a key into this dictionary. Or do you mean something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I meant having the version as the key was odd, why not just as a key-value in the dict, e.g. version : '0.14.1'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured, something like this:

for ver in list(supported_versions):
    if ver in state:
        setstate_ver(state[ver])

would be easier on the eye than:

for ver in list(supported_versions):
     for d in state.values():
        if d['version'] == ver:
            setstate(d)
            break

But it's not a strong opinion, rather a gut feeling. If you insist, I'll make the version a dictionary element again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, either way is fine. Just trying to make it easy on future versions.

@jreback
Copy link
Contributor

jreback commented Jun 18, 2014

I think the format should be a dict. When I meant forwards compat I a) meant 'easy' to read / modify w/o breaking in the future, and backwards if possible by prior versions (e.g. >= 0.13.0), not sure if that is possible even.

Or maybe even not break it unless it is really necessary (e.g. no other way to do it).

@immerrr
Copy link
Contributor Author

immerrr commented Jun 19, 2014

0.13.0+ are able to read this "extended" format, but there is a difference which might need to be guarded against future breakage, which AFAIU is done via the legacy pickle tests. Or do I get that wrong?

@jreback
Copy link
Contributor

jreback commented Jun 19, 2014

the extended tests are really just to maintain back compat (unfortunately they don't cover EVERYTHING). I don't think their are duplicated indexers there. That said could generate some new pickles for particular versions (e.g. just run a 'new' version of the script in a virtual env). Its robust to adding things (just skips)

@immerrr
Copy link
Contributor Author

immerrr commented Jun 20, 2014

Ok, I'm confused.

Do you want me to add something to legacy pickle verification infrastructure or not?
What else would you like to see to get this merged?

- Support pickling ``Series``, ``DataFrame`` and ``Panel`` objects with
non-unique labels along *item* axis (``index``, ``columns`` and ``items``
respectively).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue reference here

@jreback
Copy link
Contributor

jreback commented Jun 20, 2014

minor comment

I think I'd like you to modify the legacy_pickle script. to add some examples of duplicated frames. then run and generate it for 0.13 (and add the generated pickle to the repo, on your platform is fine). It should then break master (w/o this fix), but work with it.

@immerrr
Copy link
Contributor Author

immerrr commented Jun 23, 2014

then run and generate it for 0.13 (and add the generated pickle to the repo, on your platform is fine). It should then break master (w/o this fix), but work with it.

I don't think it'll work with this fix for a pickle of dup-item object generated in 0.13 code, there will just be no placement information to use.

@jreback
Copy link
Contributor

jreback commented Jun 23, 2014

ok, should change the script going forward (egenerated pick) in any event; so when we make the pickles for 0.14.1 it will preserve pickle changes.

@jreback
Copy link
Contributor

jreback commented Jun 30, 2014

@immerrr ok....pls rebase....

@immerrr
Copy link
Contributor Author

immerrr commented Jul 1, 2014

@jreback, fixed, rebased & green

jreback added a commit that referenced this pull request Jul 1, 2014
ENH: change BlockManager pickle format to work with dup items
@jreback jreback merged commit eb7724e into pandas-dev:master Jul 1, 2014
@jreback
Copy link
Contributor

jreback commented Jul 1, 2014

thanks

@immerrr immerrr deleted the blockmanager-new-pickle-format branch July 1, 2014 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: v14.0 Error when unpickling DF with non-unique column multiindex
2 participants